Computer-assisted pronunciation training—Speech synthesis is almost all you need
نویسندگان
چکیده
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, well the analysis of different representations speech signal. Despite significant progress recent years, existing CAPT are not able to detect errors with high accuracy (only 60% precision at 40%–80% recall). One key problems is low availability mispronounced that needed for reliable error detection models. If we had a generative could mimic produce any amount data, then task detecting would be much easier. We present three innovative techniques based phoneme-to-phoneme (P2P), text-to-speech (T2S) speech-to-speech (S2S) conversion generate correctly pronounced synthetic show these only improve machine models errors, but also help establish new state-of-the-art field. Earlier studies have used simple generation P2P conversion, an additional mechanism detection. We, other hand, consider first-class method errors. effectiveness assessed tasks lexical stress Non-native English corpora German, Italian, Polish speakers evaluations. best proposed S2S technique improves AUC metric by 41% from 0.528 0.749 compared approach.
منابع مشابه
All You Need Is Mentorship
I find it humbling to confess that most of the truly original ideas that have driven my research group’s agenda over four decades of time have come, not from my own brain, but instead from the minds of my trainees, both graduate students and post-docs. This on its own might explain why I, rather selfishly, have given them long leashes, allowing them to strike out on their own and craft their ow...
متن کاملAll You Need Is Compassion
The paper presents a new deductive rule for verifying response properties under the assumption of compassion (strong fairness) requirements. It improves on previous rules in that the premises of the new rule are all first order. We prove that the rule is sound, and present a constructive completeness proof for the case of finite-state systems. For the general case, we present a sketch of a rela...
متن کاملCNN Is All You Need
CNNs have been successfully used in audio, image and text classification, analysis and generation [12,17,18], whereas the RNNs with LSTM cells [5,6] have been widely adopted for solving sequence transduction problems such as language modeling and machine translation [19,3,5]. The RNN models typically align the element positions of the input and output sequences to steps in computation time for ...
متن کاملAttention is All you Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. E...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Speech Communication
سال: 2022
ISSN: ['1872-7182', '0167-6393']
DOI: https://doi.org/10.1016/j.specom.2022.06.003